Detecting SNP markers discriminating horse breeds by deep learning

The assignment of an individual to the true population of origin using a low-panel of discriminant SNP markers is one of the most important applications of genomic data for practical use. The aim of this study was to evaluate the potential of different Artificial Neural Networks (ANNs) approaches consisting Deep Neural Networks (DNN), Garson and Olden methods for feature selection of informative SNP markers from high-throughput genotyping data, that would be able to trace the true breed of unknown samples. The total of 795 animals from 37 breeds, genotyped by using the Illumina SNP 50k Bead chip were used in the current study and principal component analysis (PCA), log-likelihood ratios (LLR) and Neighbor-Joining (NJ) were applied to assess the performance of different assignment methods. The results revealed that the DNN, Garson, and Olden methods are able to assign individuals to true populations with 4270, 4937, and 7999 SNP markers, respectively. The PCA was used to determine how the animals allocated to the groups using all genotyped markers available on 50k Bead chip and the subset of SNP markers identified with different methods. The results indicated that all SNP panels are able to assign individuals into their true breeds. The success percentage of genetic assignment for different methods assessed by different levels of LLR showed that the success rate of 70% in the analysis was obtained by three methods with the number of markers of 110, 208, and 178 tags for DNN, Garson, and Olden methods, respectively. Also the results showed that DNN performed better than other two approaches by achieving 93% accuracy at the most stringent threshold. Finally, the identified SNPs were successfully used in independent out-group breeds consisting 120 individuals from eight breeds and the results indicated that these markers are able to correctly allocate all unknown samples to true population of origin. Furthermore, the NJ tree of allele-sharing distances on the validation dataset showed that the DNN has a high potential for feature selection. In general, the results of this study indicated that the DNN technique represents an efficient strategy for selecting a reduced pool of highly discriminant markers for assigning individuals to the true population of origin.

Mimicking the behavior of the biological brain in the nerve system is the base of Artificial Neural Networks (ANNs), which are the information processing tools 13 . Researchers have argued the shortcomings of ANN, including the complexity of analysis, computational cost, and time consumption. However, we must mention that ANN's high prediction accuracy compensates its drawbacks to a great extent. Deep Neural Networks (DNN) have been employed to analyze biological data 14,15 . They have many applications in feature abstraction and selection 16,17 . DNNs were able to construct many biological prediction models 18 , but their power of feature selection had been ignored for individual discrimination.
The ANNs have recently been applied as a powerful statistical modeling technique for many areas of different biological data, especially in the animal sciences 19,20 . Fernández, et al. 21 have indicated that ANNs were suitable to be used in fields of time series data for weekly milk prediction and clustering individuals in goat flocks. Ince and Sofu 22 modeled data with ANN for the prediction of the sheep milk yield by using the back-propagation algorithm.
For feature selection (FS) based on ANN, a comparison was made in this study to discriminate among different horse breeds as well as to assign new individuals to their breed. Statistically, in the analysis of GWAS, all SNPs act separately and conduct the research with significant results. The consequence of this analysis obtains the identification of significant SNP markers, but the relationships between them are ignored. While the network approach is more reliable and logical monitoring all SNPs simultaneously leads to better results efficiently.
To obtain the best results, allele dosage has been applied to ANNs, which is a completely unbiased estimation. The Garson (weights) algorithm illustrates behavioral instability in the analysis, which can be considered a weakness 23 . Unlike most studies, Olden, et al. 24 examined the performance of the Garson algorithm in the variable selection on simulated data, and have found that it has the lowest efficiency compared with other studied algorithms. Ibrahim 25 showed that the Olden and Garson methods had the weakest results. The results of Fischer 26 revealed that the Garson algorithm has a higher degree of stability in modeling non-linear relationships. Additionally, other studies have used the Garson and Olden algorithms, which are only applicable to ANN with a single hidden layer.
To the best of our knowledge, researchers had not investigated the potential of feature selection by ANN approaches for assigning individuals in horse breeds. We have analyzed the ANN's potential to characterise, whether ANNs can be used as a tool for tackling the curse of dimensionality of SNP(s) data. We attempted to compare the DNN alongside a brief description of Garson and Olden methods to gain the relative importance of variables (SNP markers). While the DNN is a multiple hidden layer ANN, the two mentioned methods are compatible with a single hidden layer. This paper is one of the first studies to determine the discriminant SNP(s) on a large scale by using the sophisticated methods of ANN approaches. We have conducted this study intending to find distinct SNP markers to reduce the dimensions of the SNP panels as well as comparing different variable selection methods such as Garson and Olden through the ANN approach.

Results and discussion
Feature selection: comparison between three approaches. In the current research, we have used the three feature selection (FS) methods namely Olden, Garson, and DNN. Neural networks are commonly referred to as powerful and efficient statistical modeling techniques by various researchers 25 . Many studies have compared different FS methods [26][27][28][29] . The selection criteria for the variables in the DNN structure were the absolute value of the first hidden layer connection weights that they assumed as the regression coefficient. According to the DNN procedure, 4270 SNP markers had been selected for the rest of the analysis. The Garson and Olden algorithms led to a selection of 4937 and 7999 SNP markers for further analysis, respectively. The reason for choosing a more significant number of SNP tags for the Olden algorithm is the low transparency of the PCA plot. We must have mentioned that increasing the number of tags did not increase transparency anymore, this could be due to no linear relationship between SNPs number and PCA plot transparency. Moreover, the absolute increase of markers did not include a useful index for improvement unless the marker allele frequencies were different across subpopulations.
After the selection process of SNP markers, all SNP markers were sorted based on the calculated coefficient. The 460 top-rank SNP of each approach was selected, and all sub-SNP sets were compared to each other to find the common markers (Table 1). Table 1 represents the common SNP(s) in the prime 460 SNP markers. It indicates that all three methods had at least a 34% overlap (the average number of common SNPs is 158).
Regarding Table 1, we have found the lowest number of SNP markers between the DNN and Garson approaches. This phenomenon could be owing to the weights of the first layer in the two approaches. We have obtained the most significant number of SNP markers between Garson and Olden. This evidence shows that Garson and Olden had similar mechanisms for feature selection by using NN's weights in the input-hidden and Table 1. Comparison among three feature selection methods based on prime 460 selected SNP markers. The upper triangle represents the Spearman correlation among three methods for ranking markers. While the lower triangle represents the number of common tags between feature selection methods. hidden-output layers. The Spearman correlation for coefficients of common markers indicated a strong relationship between Garson and Olden methods (98.10%). Also, the association obtained between DNN and Garson methods is 43.1%, which is confirmed by the number of common SNP markers. In general, most of the studies have widely used the Olden and Garson approaches. The results of Olden, et al. 24 revealed that the Olden method was the best overall methodology for processing and identifying the variable importance in the neural network, especially when the inputs had a weak or strong correlation with output. Fischer 26 compared the Olden and Garson methods and reported that the results obtained by the Garson method are preferable and more stable than those obtained by the Olden method for nonlinear relationships. Findings from his study have shown that ranks obtained by the Garson approach may be more reliable than the Olden method, especially when those ranks are used for modeling nonlinear data such as positive and negative quadratics and interactive data. The results of these studies indicated that the Olden (Connection weights) method had an excellent performance for different assumptions and, Garson (Weights), as the ancestor of the weighted methods, had a various behavior in these studies.
All mentioned studies used the simulated or ecological data in which the maximum input variables were less than 20 variables. At first glance, both Olden and Garson's algorithms used the input-hidden and hiddenoutput connection weights for calculating the importance of variables. The linear regression modeling habe been used as a control method on the real datasets for evaluating the input's significance in some studies 23,25 , and some others have used simulated data where the data have mostly contained the linear 24,28 or semi-linear relationship 27 . However, the DNN approach could raise the performance and efficiency of the artificial neural network in circumstances where a large number of input variables (for example, genomic data of the globally equine breeds) have confronted the system. Feature selection: a comparison based on PCA analysis. In the first place to assess the degree of divergence among samples, the principal component analysis (PCA) was applied to determine how the animals were allocated to the groups 30 . The actual coefficients of SNP markers have been obtained step by step according to the original PCA plot, which is according to the numerical analysis in mathematics. In other words, after choosing a new coefficient, the PCA plot was drawn, and the breed distinction was compared with the main PCA plot created by 50K SNP markers panel (Fig. 1). After marker selection and discovering the subsets of markers,  www.nature.com/scientificreports/ PCA analysis was performed using all three sub-SNP(s) and total 50k SNP(s) available on SNP chip (Figures S1 (DNN), S2 (Garson), and S3 (Olden)). The results indicated an excellent performance of PCA in distinct individuals into separated groups. PCA analysis has identified two subpopulations of Thoroughbred, (TB_UK & TB_US), as one breed, and a similar result was obtained for Standardbred (STBDNor & STBDUS) too. In Fig. 1, some breeds overlapped, but according to the symbols of each breed, we can say that these breeds are properly distinct from each other. Some breeds like Clyd, Shire, Shet, Ice, Mini, and TB (UK-US), were located in corners of the PCA plot, and this fact is due to the geographic boundaries of their countries (Table 5). In other words, these breeds belong to countries that have common borders. As a result, they might have more genetic resource exchanges with each other. Although STBD (including Nor and US) overlapped with Paint and Quarter breeds, they were completely separated by likelihood assessment. Asian breeds (AKTK, ARR, and CSP) were located near the center of the PCA plot and overlapped with Central European Breeds (CEB). It is highlighting this point that Asian breeds have a lot of common characteristics with CEB. The PCA analysis was performed for each method by selected SNP markers (Figs. S1 (DNN), S2 (Garson), and S3 (Olden)). The breed distinction is in good agreement with the main PCA plot created by 50K SNP markers (Fig. 1).

Assessment of different methods and the number of SNP(s) to assignment.
We have estimated the likelihood of assigning 795 individual genotypes to their known origins (or breeds) by the Paetkau, et al. 31 approach. Although one particular breed (Shire) had at least one failure assignment by each method. In general, all three feature selection methods assigned most of the individuals to the right population. It resulted in a 9% reduction in the potential of the assignment procedure. Two individuals in the Shire breed failed in all subsets. Red arrows indicate these individuals in Fig. 2.
With the analysis of assignment and concerning values of LLR, obtained results showed that one failure was recognized as Belgian breed by three methods, and the other one was known as different breeds like Paint, Quarter, Swiss warmblood, and Thoroughbred-US. By using three methods, the first individual has 97.30% accuracy to be assigned to the correct race (Shire). By DNN, and Olden approaches, the second individual also had 91.89% accuracy for being appointed into the right breed. For further explanation, these failures might be www.nature.com/scientificreports/ due to hybrid or crossbreeding parentage. There were two Shire individuals in the center of the PCA plot (Fig. 2); the assignment method indicated that they belong to their breed (Green arrows). In Fig. 3, we have demonstrated the correctness plots for three feature selection algorithms at various strict levels.
As shown in Fig. 3, all three methods revealed different behavior for the success percentage of genetic assignment. In the DNN, the success rate in selecting the correct animal breed was more than in the other methods. The sufficient number of SNP markers required to correctly assign an unknown animal to its exact breed/origin at different threshold levels (90%, 95%, and 98%) have been shown for DNN, Garson, and Olden methods in Table 2.
We have accurately calculated the percentages of individuals and correct assignments for different numbers of SNP markers. Testing the performance of each approach has been done at four different levels of LLR analysis.  www.nature.com/scientificreports/ We found that DNN performed better than the other two approaches by achieving 93% accuracy at the most stringent threshold (LLR > 4) ( Table 3). In this section, the Garson method did not perform well. The results revealed that the DNN outperformed other methods with fewer SNP markers. Generally, about 500 discriminant SNP markers enabled us to assign new individuals to the right groups using different ways. There are some issues related to the comparison of results in this study with other ones. First, many previous studies used another type of marker with only a limited number of tags [32][33][34][35] . Second, there were different methods in several studies 36 . Maudet, et al. 32 found out that, by using 23 microsatellite loci, they could be assigned more than 90% of individuals to their breed. Negrini, et al. 37 used the limited set of available SNP markers for an individual assignment. Aiming to determine the range of the minimum number of SNP markers (from 60 to 140), Wilkinson, et al. 38 worked for assigning individuals in 17 Bovine breeds.

Model validation
PCA and LLR analysis for validation data. We have used a separate dataset to test the model. Firstly, we have applied the PCA analysis to find the relationship among the breeds like the training dataset (Fig. 4).
In Fig. 4, the Quarter and Warmblood have a small overlap. We identified and extracted the selective SNP markers of 3 feature selection methods (from panel 50K) in the evaluation dataset. Common extracted SNP markers were maintained for later analysis. We have isolated and extracted 839 (Fig. S4-DNN) Table 4.
The results of this section revealed that all three artificial neural networks had an excellent performance. The Garson method with a minimum number of markers (fifteen) had a 60% accuracy, which may be due to the low number of animals and the distinction between the source in the test data, because there are significant differences between the countries of Switzerland, France, and England (the continent of Europe) and the countries of the Middle East and the Americas (Asian and American continents).
By using one dataset, there is a possibility to observe a negligible amount of kinship relationships. Because all individuals are sampled from one herd, kinship relationships are practically inevitable in the research. Therefore, using new data from other sources reduces the probability of kinship among individuals. If unknown or novel information is introduced to the desired network, the least errors will get. Previously obtained results of the network were reliable enough for DNN to infer the right class of novel information precisely. In this case (DNN), the system undoubtedly possesses much power and much success in correctly determining the essential features.
Neighbor-Joining tree of allele-sharing distances for validation data. For a better understanding, we have used the Neighbor-Joining tree of allele-sharing distances on the validation dataset. Neighbor-Joining analysis performs better than PCA analysis on topics such as breed-level differentiation, the intermingling of breeds, outliers, genetic isolation, etc. First, we have analyzed whole genomic data (32419 SNP markers, 120 horses) to show the breed-level differentiation in validation data (Fig. S7).
Then, the Neighbor-Joining analysis was done for each obtained dataset (Fig. S8 (DNN), S9 (Garson), S10 (Olden)) to demonstrate the breed distinction in comparison to the whole data. In Fig. S7, except for two groups (Quarter Horse and Warmblood) and despite the low amount of SNP markers, the rest of the breeds were in their real groups. It is critical to consider that two breeds (Quarter Horse and Warmblood), may have an unusual overlap due to the low number of markers.
We have drawn Fig. S8 by using the markers selected by the DNN. It is noteworthy that the classification of individuals is mostly successful, and there is no significant overlap between breeds. The Neighbor-Joining plot (Fig. S9) drawn by the selected markers of the Garson method did not have a good quality in terms of the classification of individuals. In Fig. S9, there was a great deal of unusual overlap between the breeds, and only the Thoroughbred was identified as a pure breed due to the small number of individuals. The number of outsiders in the results of this dataset was very high (red arrows).  (Fig. S10). In a way, its plot was promising. Perhaps the only disadvantage of the Olden method compared to the other two is that despite the high number of SNP markers, two individuals (Arabian-3 and QuarterHorse-1) still have been identified as outsiders.

Conclusion
We have used the weights of the first hidden layer of the DNN, for selecting and ranking variables (SNPs). Artificial neural networks (ANNs) will receive a great deal of attention in the various scientific fields, given that they are powerful statistical modeling techniques. However, in an attempt to provide useful insights into the contributions of the input (independent) variables in the prediction process, they have been labeled as the "black box" technique. As mentioned earlier, many published studies had been conducted to clarify the interpretation of the connection between the neurons in ANN.
By comparing the results, the Garson and Olden procedures only work with a single hidden layer and single output unit, while multiple layer networks (DNN) do not suffer these limitations. Regarding log-likelihood ratio (LLR) for the individual assignment, the obtained results by this research revealed that ANN's feature selection methods could be used for genomic data, especially for dimension reduction by DNNs. This finding solves the most critical issue for genetics researchers in dealing with the considerable dimension of data. Researchers can   www.nature.com/scientificreports/ use DNN in the field of animal sciences because of the high performance of breed discriminants. Researchers in the field of genetics and breeding are seeking to reduce the number of biomarkers to find a link between the observed phenotype and these markers. The result of this study showed that the DNN has a high potential for feature selection in genomic data along with more flexibility in the application of ANNs in the field of animal sciences. Results also showed that using the connection weight of the first hidden layer in a DN Network provides the possibility to reach a high optimum level of accuracy for ranking and selecting the variables (SNP(s)). Another conclusion of this research is that the most critical weights for output values of every variable in a DN Network are the weights in the first hidden layer because all connected loads of the next layers are functions of the first layer's connected load. If three analyzes of PCA, LLR, and Neighbor-Joining achieve the desirable results, we will get the real discriminative features.
It is necessary to point out that the results of this study shed some lights on the using of DN Networks (especially pattern recognition) in genetics and breeding. Feature selection in the genetic field particularly on SNP markers is in the infancy period. The computation time will be reduced significantly. It should also be noted that the DNN network is increasing computing time but it was decreasing the error rate significantly. It can open a new opportunity to extend human insights.
Finally, we think that this will be a fruitful approach to the study of existing domestic populations, such as inferior local breeds and strains in developing countries. In general, the present paper highlighted the importance of variable selection from the varying point of view, including the socio-economic perspective (for developing a low-cost customized assay for assigning the breeds or tracing the origin of animal products derived from diverse species).

Materials and methods
The data for training ANN. A total of 795 animals from 37 breeds of horse populations were genotyped by using the Illumina SNP 50k Bead chip (Illumina, San Diego, CA, USA). Petersen et al. 7 have already described the comprehensive description and necessary details of data mining. In summary, Table 5 has given the breed names, the ID of breeds, the geographic origin, minor allele frequency (MAF), Heterozygosity, and the number of animals. Genotype data are coded as the number of reference SNP allele carries, that is, 0 (for AA), 1 (for AB), and 2 (for BB). In the present study, a further filtration for the call rate (the proportion of SNP genotypes) less than 99% was used to discard the missing genotypes 39,40 .
Moreover, raw predictor variable data (SNP matrix) is used as the input variable in ANN. It is assumed that each of these markers represents a mathematical variable that can only hold 3 inputs (0, 1, and 2).

The data for testing and validation methods.
To assess the performance of the ANN methods, learning and evaluation were performed using two separate datasets, respectively. The testing dataset contains 120 individuals from eight breeds ( Table 6 includes the sample information). You can find all the details and information about the validation data in the article by Schaefer, et al. 41 . Data preprocessing included extracting com- www.nature.com/scientificreports/ mon SNP markers between panels of 50K and 2M. This process resulted in the identification of 32K markers, and 14K of these markers remained after quality control (call rate 99%) for further analysis.
ANN model and construction. Artificial neural networks represent complex structures that are generated by fundamental units (elements) called neurons 22 . Neurons and their connections create a specific network architecture such as multilayer perceptron (MLP), self-organizing map (SOM), etc. 13 . In terms of genomic data analysis, we used two types of ANN architecture. The first one is a feed-forward multilayer perceptron (DNN) with two hidden layers, and the second one is a standard single hidden layer (ANN) with a back-propagation algorithm for the weight adjustments 42,43 . In Figure 5, The architecture of a single hidden layer ANN has been shown for better understanding. Neural net 44 and Neural Net Tools 45 packages were applied by R software (version 3.4.0) 46 to select informative and unique SNP markers that are within each breed. The mentioned algorithms (Garson and Olden) have been utilized by ANN to detect the relative importance of variables for the breed diversity characterization. The large dimension of the SNP-panel leads to a stack overflow error in the computing process. De Oña and Garrido 29 have proposed the usage of a set of neural networks instead of a single one. In contrast to 29 in the present work, the high-density SNP chip was partitioned into the sub-datasets with the same dimension and were used as input to identify the discriminant SNP(s). Table 6. The name, identification code, geographic origin, size of samples (N), minor allele frequency (MAF), and observed heterozygosity (HO) of different horse breeds (Validation dataset). The mean value of MAF statistic over all samples was estimated 0.2585 and the minimum and maximum of MAF were observed in Franches Montagnes and Thoroughbred breeds, respectively (0.2195 and 0.3758).  www.nature.com/scientificreports/ Feature selection: Garson and Olden. Weights (Garson approach), had been described by Garson 47 and has also been modified by Goh 48 . It was used to identify the relative importance of input variables by the calculated weights within connections in a supervised neural network. The Garson approach indicates relative importance values as the absolute magnitude ranging from zero to one (0-1). Olden and Jackson 49 had proposed connection weights, also known as the Olden approach that has been used in this research.
Feature selection: DNN approach and its architecture. For the DNN approach, the ANN with two hidden layers was used to identify the discriminant SNP(s) within breeds. Many combinations exist for selecting the number of nodes in the hidden layer 50 . The optimal number of nodes in the first and second hidden layers detected 40 and 38 nodes after testing a range of combinations. Finally, ANN with Garson and Olden algorithms contained 40 nodes in the hidden layer.
We have used the final fitted weights of the neural network for selecting the genetic markers. In the DNN approach, we assumed there was a linear relationship between the variable and the response 12 . We considered the SNP markers to retain a direct relationship with the horse breeds. (Eq. 1).
where Y is the matrix of observed values for the desired breeds, g is a vector of weights of SNP markers, and e is the vector of residual terms. X is known as the design matrix that relates the elements of g to its corresponding element in Y. Assuming that higher coefficient values in this (regression) equation have a significant effect on the output variable, the absolute maximum weight obtained by DNN led to the selection of SNP markers that caused the diversity of the breeds.

DNN Approach
Input: 1. Convert group labels to numbers to present to the neural network (creating the output matrix -dimensional of the matrix: 795*48000) 2. Delete (columns) markers with unknown value 3. Dividing the marker matrix into smaller matrices 4. Network design with the following layers: Input layer The first hidden layer The second hidden layer Output layer Output: The small set of the estimated coefficients from the first hidden layer to find the effective markers.
Steps: 1. For each data set, the network was executed and the weights of the first hidden layer were stored. 2. In the end, all the estimated weights for each variable, which were equal to 40, were obtained and a matrix with specific dimensions was made (48000*40). 3. The absolute value of all entries of the weight matrix was calculated, so that the negative sign of some of them would not cause future problems. 4. The maximum value was obtained for each marker from the obtained weights. 5. If the number obtained value from the formula mentioned in the text was greater than the threshold, then that marker is selected as the effective variable. 6. A small set of markers, that are more effective than the rest, are extracted from the original data. Figure 6 shows the whole analysis process. The researchers must determine the features according to Eq. (2), after the convergence of the neural network (Fig. 6). Feature selection is based on the absolute value of the weights of the first hidden layer. It should be noted that 40 weights have been calculated for each variable. In this step, the maximum value is obtained for each variable. If the obtained value was greater than the coefficient of Eq. (2), then that variable was selected as the effective SNP marker.
By considering Eq. (2), it is assumed that all variables are doing their job with maximum potential. Then, a selection threshold was defined to choose a small set of variables. As previously described, in this status, the effects of all variables are not estimated equally and we see the minimum and maximum values among them. The reason for assuming maximum potential is that we do not know what is the actual effect of each variable in biological data. Therefore, we considered every marker on the same level and allowed them to make their inferences

Individual assignment analysis.
There are several available approaches for genetic assignment 31,51,52 .
The method of Paetkau, et al. 31 has been used for the assignment analysis (as had been described by 38 ), and it had high effectiveness on individual assignment when high levels of genetic differentiation between reference   52 . It is noteworthy that the SNP markers were applied instead of the microsatellites. We have calculated the log-likelihood ratios (LLR) to accurately assess the performance of the assignment procedure. The log-likelihood ratios (LLR) will be calculated by comparing the probability of an individual assigned to its real population to the probability of it assigned to another population (Eqs. 3 and 4).
where, Different stringency thresholds are applied as confidence levels of assignment precision. Four stringency levels were used: LLR > 1, 2, 3 & 4, which means a multi-locus genotype should be 10, 100, 1000 & 10000 times more similar to the true population rather than the other one. If a calculated LLR value was lower than the selected stringency levels, the individual genotype would fail to assign to its unique origin. In other words, it would assign to the pseudo reference population. The correct assignment of an individual genotype to its known origin occurred when the calculated LLR was greater than the selected stringency levels.
The aim of evaluating a classification model is to evaluate and understand its flexibility, behavior, and prediction ability in dealing with new or unknown samples. ; Ethical Committee of the Canton of Bern (BE33/07, BE58/10 and BE10/13)) No commercial animals were used in this study. Written informed client consent describing the purpose and duration of the study, procedures, potential risks and benefits and containing study contact information were obtained from private owners.

Data availability
Training Data-set: All SNP genotype data are available at the NAGPR Community Data Repository (animalgenome.org) for the purpose of reconstructing the analyses. The only exception is the data collected from the Tennessee Walking Horse, which, under agreement from the granting agency (to the University of Minnesota from the Foundation for the Advancement of the Tennessee Walking Show Horse (FAST) and the Tennessee Walking Horse Foundation (TWHF)), is only available under a Material Transfer Agreement (MTA) between interested individuals and the University of Minnesota. Testing Data-set: Whole genome sequences are available in the following NCBI BioProjects: PRJEB14779, PRJNA273402, and PRJEB10098. Additional sequences are restricted in availability due to pre-existing material transfer agreements and can be requested by contacting the contributing investigator in Additional file 1: Table S1. Genotypes for horses on the MNec2M array will be released upon publication. Genome positions for all 23 million discovered SNPs have been submitted to dbSNP as well as the European Variation Archive.